使用机器学习算法从未标记的文本中提取知识可能很复杂。文档分类和信息检索是两个应用程序,可以从无监督的学习(例如文本聚类和主题建模)中受益,包括探索性数据分析。但是,无监督的学习范式提出了可重复性问题。初始化可能会导致可变性,具体取决于机器学习算法。此外,关于群集几何形状,扭曲可能会产生误导。在原因中,异常值和异常的存在可能是决定因素。尽管初始化和异常问题与文本群集和主题建模相关,但作者并未找到对它们的深入分析。这项调查提供了这些亚地区的系统文献综述(2011-2022),并提出了共同的术语,因为类似的程序具有不同的术语。作者描述了研究机会,趋势和开放问题。附录总结了与审查的作品直接或间接相关的文本矢量化,分解和聚类算法的理论背景。
translated by 谷歌翻译
Automatic Text Summarization (ATS) is becoming relevant with the growth of textual data; however, with the popularization of public large-scale datasets, some recent machine learning approaches have focused on dense models and architectures that, despite producing notable results, usually turn out in models difficult to interpret. Given the challenge behind interpretable learning-based text summarization and the importance it may have for evolving the current state of the ATS field, this work studies the application of two modern Generalized Additive Models with interactions, namely Explainable Boosting Machine and GAMI-Net, to the extractive summarization problem based on linguistic features and binary classification.
translated by 谷歌翻译
Early recognition of clinical deterioration (CD) has vital importance in patients' survival from exacerbation or death. Electronic health records (EHRs) data have been widely employed in Early Warning Scores (EWS) to measure CD risk in hospitalized patients. Recently, EHRs data have been utilized in Machine Learning (ML) models to predict mortality and CD. The ML models have shown superior performance in CD prediction compared to EWS. Since EHRs data are structured and tabular, conventional ML models are generally applied to them, and less effort is put into evaluating the artificial neural network's performance on EHRs data. Thus, in this article, an extremely boosted neural network (XBNet) is used to predict CD, and its performance is compared to eXtreme Gradient Boosting (XGBoost) and random forest (RF) models. For this purpose, 103,105 samples from thirteen Brazilian hospitals are used to generate the models. Moreover, the principal component analysis (PCA) is employed to verify whether it can improve the adopted models' performance. The performance of ML models and Modified Early Warning Score (MEWS), an EWS candidate, are evaluated in CD prediction regarding the accuracy, precision, recall, F1-score, and geometric mean (G-mean) metrics in a 10-fold cross-validation approach. According to the experiments, the XGBoost model obtained the best results in predicting CD among Brazilian hospitals' data.
translated by 谷歌翻译
Some recent pieces of work in the Machine Learning (ML) literature have demonstrated the usefulness of assessing which observations are hardest to have their label predicted accurately. By identifying such instances, one may inspect whether they have any quality issues that should be addressed. Learning strategies based on the difficulty level of the observations can also be devised. This paper presents a set of meta-features that aim at characterizing which instances of a dataset are hardest to have their label predicted accurately and why they are so, aka instance hardness measures. Both classification and regression problems are considered. Synthetic datasets with different levels of complexity are built and analyzed. A Python package containing all implementations is also provided.
translated by 谷歌翻译
The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.
translated by 谷歌翻译
Chronic pain is a multi-dimensional experience, and pain intensity plays an important part, impacting the patients emotional balance, psychology, and behaviour. Standard self-reporting tools, such as the Visual Analogue Scale for pain, fail to capture this burden. Moreover, this type of tools is susceptible to a degree of subjectivity, dependent on the patients clear understanding of how to use it, social biases, and their ability to translate a complex experience to a scale. To overcome these and other self-reporting challenges, pain intensity estimation has been previously studied based on facial expressions, electroencephalograms, brain imaging, and autonomic features. However, to the best of our knowledge, it has never been attempted to base this estimation on the patient narratives of the personal experience of chronic pain, which is what we propose in this work. Indeed, in the clinical assessment and management of chronic pain, verbal communication is essential to convey information to physicians that would otherwise not be easily accessible through standard reporting tools, since language, sociocultural, and psychosocial variables are intertwined. We show that language features from patient narratives indeed convey information relevant for pain intensity estimation, and that our computational models can take advantage of that. Specifically, our results show that patients with mild pain focus more on the use of verbs, whilst moderate and severe pain patients focus on adverbs, and nouns and adjectives, respectively, and that these differences allow for the distinction between these three pain classes.
translated by 谷歌翻译
在复杂,非结构化和动态环境中导航的董事会机器人基于在线事件的感知技术可能会遭受进入事件速率及其处理时间的不可预测的变化,这可能会导致计算溢出或响应能力损失。本文提出了尽快的:一种新型的事件处理框架,该框架将事件传输到处理算法,保持系统响应能力并防止溢出。尽快由两种自适应机制组成。第一个通过丢弃传入事件的自适应百分比来防止事件处理溢出。第二种机制动态调整事件软件包的大小,以减少事件生成和处理之间的延迟。ASAP保证了收敛性,并且对处理算法具有灵活性。它已在具有挑战性的条件下在船上进行了验证。
translated by 谷歌翻译
事件摄像机可以通过非常高的时间分辨率和动态范围来捕获像素级照明变化。由于对照明条件和运动模糊的稳健性,他们获得了越来越多的研究兴趣。文献中存在两种主要方法,用于喂养基于事件的处理算法:在事件软件包中包装触发的事件并将它们逐一发送作为单个事件。这些方法因处理溢出或缺乏响应性而受到限制。当算法无法实时处理所有事件时,处理溢出是由高事件产生速率引起的。相反,当事件包的频率太低时,事件包的生成率低时,缺乏响应率会发生。本文提出了尽快的自适应方案,该方案是通过可容纳事件软件包处理时间的可变大小软件包来管理事件流的。实验结果表明,ASAP能够以响应性和有效的方式喂食异步事件聚类算法,同时又可以防止溢出。
translated by 谷歌翻译
我们研究了图结构识别的问题,即在时间序列之间恢复依赖图的图。我们将这些时间序列数据建模为线性随机网络动力学系统状态的组成部分。我们假设部分可观察性,其中仅观察到一个包含网络的节点子集的状态演变。我们设计了一个从观察到的时间序列计算的新功能向量,并证明这些特征是线性可分离的,即存在一个超平面,该超平面将与连接的节点成对相关的特征群体与与断开对相关的节点相关联。这使得可以训练各种分类器进行因果推理的功能。特别是,我们使用这些功能来训练卷积神经网络(CNN)。由此产生的因果推理机制优于最先进的W.R.T.样品复杂性。受过训练的CNN概括了结构上不同的网络(密集或稀疏)和噪声级别的轮廓。值得注意的是,他们在通过合成网络(随机图的实现)训练时也很好地概括了现实世界网络。最后,提出的方法始终以成对的方式重建图,也就是说,通过确定每对相应的时间序列中的每对节点中是否存在边缘或箭头或不存在箭头。这符合大规模系统的框架,在该系统中,网络中所有节点的观察或处理都令人难以置信。
translated by 谷歌翻译
机器学习(ML)的指数增长引起了极大的兴趣,以量化用户定义的信心水平的每个预测的不确定性。可靠的不确定性定量至关重要,是迈向增加对AI结果的信任的一步。在高风险决策中,它变得尤为重要,在这种决策中,真正的输出必须在置信度范围内具有很高的可能性。共形预测(CP)是一个无分布的不确定性定量框架,可适用于任何黑框模型,并产生预测间隔(PI),这些预测间隔(PIS)在轻度的交换性假设下有效。 CP型方法由于易于实施和计算便宜而变得越来越流行;但是,交换性假设立即排除时间序列预测。尽管最近的论文解决了协变量的转变,但对于一般时间序列预测生产H-Step提前有效PI的问题还不足。为了实现这样的目标,我们提出了一种称为AENBMIMOCQR的新方法(自适应集合批量多输入多输出保形的分数回归),该方法会产生渐近有效的PIS,适合异质驱动时间序列。我们将提出的方法与NN5预测竞争数据集中的最新竞争方法进行比较。所有用于复制实验的代码和数据都可以使用
translated by 谷歌翻译